Julia's BLOG

机器学习之-Rule Learning

2018-09-05

本文所做数据处理为计算entropy(熵),应用简易的Rule Learning(规则学习)算法。

所用数据为:

sailing-custom-python.tab

zoo-python.tab

1. Import 各种包

1
2
3
import pandas as pd
import numpy as np
import math

2. 用pandas包load数据

1
2
3
sailData = pd.read_table('sailing-custom-python.tab')
zooData = pd.read_table('zoo-python.tab')
zooData = zooData.drop(columns='name')

3. 计算entropy(熵)的方法

公式参考:ipwAEt.png

1
2
3
4
5
6
7
8
def entropy(data, target):
count = pd.value_counts(data[target])
dataSize = data[target].size
entropyValue = 0
for value in count:
proportion = value/dataSize
entropyValue -= proportion * math.log(proportion, 2)
return entropyValue

测试方法体是否能运行

1
entropy(sailData, 'Sail')
1
entropy(zooData, 'type')

输出:

0.9975025463691153

2.390559682294039

方法正常执行

4. 计算最多数的col名,并返回

1
2
3
4
def majority_class(data, targetClass):
counts = pd.value_counts(data[targetClass])
max_name = counts.idxmax()
return max_name

5. 简易规则学习方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def simpler_rule_learner(data, target):
while data.shape[0] > 0:
if entropy(data, target) == 0:
print ("otherwise =>", majority_class(data,target))
data = data.iloc[0:0]
else:
best_entropy = entropy(data, target)
best_attribute = ''
best_value = ''
best_data=data

for attribute in data:
for value in data[attribute]:
data2 = data.loc[data[attribute]==value]

if entropy(data2, target) < best_entropy:
best_entropy = entropy(data2, target)
best_attribute = attribute
best_value = value

best_data=data2

print(best_attribute, "=", best_value, "=>", majority_class(best_data,target))
data = data.loc[data[best_attribute] != best_value]

测试方法:

1
simpler_rule_learner(sailData, 'Sail')
1
2
3
4
5
Company = big => yes
Outlook = rainy => no
Company = med => yes
Sailboat = small => yes
otherwise => no
1
simpler_rule_learner(zooData, 'type')
1
2
3
4
5
6
7
8
9
10
11
12
13
feathers = Yes => bird
milk = Yes => mammal
hair = Yes => insect
airborne = Yes => insect
fins = Yes => fish
legs = 8.0 => invertebrate
eggs = No => reptile
breathes = No => invertebrate
aquatic = Yes => amphibian
predator = Yes => reptile
backbone = Yes => reptile
legs = 6.0 => insect
otherwise => invertebrate

至此简易规则学习方法已经可以正确输出结果。

注:筛选某一列中值为特定的行,方法如下 (data.loc用法)

1
2
3
4
5
print(sailData)
print()
attribute = 'Outlook'
value = 'rainy'
print(sailData.loc[sailData[attribute]==value])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
   Outlook Company Sailboat Sail
0 rainy big big yes
1 rainy big small yes
2 rainy med big no
3 rainy med small no
4 sunny big big yes
5 sunny big small yes
6 sunny med big yes
7 sunny med big yes
8 sunny med small yes
9 sunny no small yes
10 sunny no big no
11 rainy med big no
12 rainy no big no
13 rainy no big no
14 rainy no small no
15 rainy no small no
16 sunny big big yes

Outlook Company Sailboat Sail
0 rainy big big yes
1 rainy big small yes
2 rainy med big no
3 rainy med small no
11 rainy med big no
12 rainy no big no
13 rainy no big no
14 rainy no small no
15 rainy no small no

以上。

使用支付宝打赏
使用微信打赏

若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏

扫描二维码,分享此文章